The data, as well as, its structure, dimensions, and origins have been described previously in here. So, we’ll skip repeating the description of data…


1 Graphical overview of data

#Read in the data
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", stringsAsFactors = TRUE, sep = ",",header = TRUE)

library(gtsummary)
#Print out a summary table of all variables
my_table <- tbl_summary(human,
                        digits = all_continuous() ~ 2,
                        type = all_continuous() ~ "continuous2",
                        statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                              "{median} ({p25}, {p75})", 
                                                              "{min}, {max}")))

#Add some missing elements to the table and print it out
my_table %>% bold_labels()
Characteristic N = 155
Edu2.FM
Mean (SD) 0.85 (0.24)
Median (IQR) 0.94 (0.73, 1.00)
Range 0.17, 1.50
Labo.FM
Mean (SD) 0.71 (0.20)
Median (IQR) 0.75 (0.60, 0.85)
Range 0.19, 1.04
Edu.Exp
Mean (SD) 13.18 (2.84)
Median (IQR) 13.50 (11.25, 15.20)
Range 5.40, 20.20
Life.Exp
Mean (SD) 71.65 (8.33)
Median (IQR) 74.20 (66.30, 77.25)
Range 49.00, 83.50
GNI
Mean (SD) 17,627.90 (18,543.85)
Median (IQR) 12,040.00 (4,197.50, 24,512.00)
Range 581.00, 123,124.00
Mat.Mor
Mean (SD) 149.08 (211.79)
Median (IQR) 49.00 (11.50, 190.00)
Range 1.00, 1,100.00
Ado.Birth
Mean (SD) 47.16 (41.11)
Median (IQR) 33.60 (12.65, 71.95)
Range 0.60, 204.80
Parli.F
Mean (SD) 20.91 (11.49)
Median (IQR) 19.30 (12.40, 27.95)
Range 0.00, 57.50

1.0.1 Interpretation of the summary table


All variables are continuous and numerical. The scales and numerical variance of the variables vary wildly, which means that we absolutely should scale the data before running PCA.

The data show mostly fairly symmetrical distributions based on the median and mean values. Some values like GNI, Ado.Birth and Mat.Mor are somewhat skewed, although their standard deviations, too, are quite large.


1.1 Correlations


library(GGally)
#library(plotly)

#Print a scatter plot matrix
ggpairs(data = human)

#Correlation plot
ggcorr(data = human)


1.1.1 Interpretation of plots


Visual inspection confirms our numerical observations of the distributions’ skewness for some variables. Also, quite strong correlation can be seen in the data for some variables.

For example, there is considerable negative correlation for Mat.Mor vs GNI, Life.Exp, Edu.Exp, and Edu2.FM. Similar negative correlations exist for Ado.Birth.

Ado.Birth and Mat.Mor, as well as, Edu.Exp and Life.Exp have a strong (>0.7) positive correlation.

Not all correlation seen in the data is necessarily linear: for example, GNI and Edu.Exp demonstrate a seemingly non-linear correlation. Moreover, there is heteroscedasticity in the data (for example GNI vs Edu.Exp) which would suggest that perhaps some variables in the data would benefit from log-transform before running the PCA. Alas, such considerations were not instructed for this assignment.


2 Principal component analysis


Time for some principal component analysis. First, lets run the analysis on unscaled data.


2.1 Before scaling


#Do PCA using SVD
my_unscaled_pca <- prcomp(human)

my_unscaled_pca
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03
summary(my_unscaled_pca)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#Print calculate the PC labels with the percentages of variance for the biplots
variance_precentage <- round(100*summary(my_unscaled_pca)$importance[2, ], digits = 3)
variance_precentage <- paste0(names(variance_precentage), " (", variance_precentage, " %)")

#Print the biplot as requested
biplot(my_unscaled_pca, cex = c(0.6, 1), col = c("grey40", "red"), xlab = variance_precentage[1], ylab = variance_precentage[2])

#What on earth is meant by "captions ... in your plots" in the instructions
#is lost on me but here's how you can add (confusing) text to your plot:
text(x=-170000,y=-28000, labels = paste0("Caption: GNI has by far the greates\n",
                                  "contribution to the unscaled variance in the\n",
                                  "data. Indeed, PC1 (aligned with GNI) explains\n",
                                  "99.99% of the variance in the data."), cex = 0.75,col="blue")


2.2 After scaling


#Scale the data
human_scaled <- scale(human)

#Do PCA using SVD
my_scaled_pca <- prcomp(human_scaled)

my_scaled_pca
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
summary(my_scaled_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
#Print calculate the PC labels with the percentages of variance for the biplots
variance_precentage <- round(100*summary(my_scaled_pca)$importance[2, ], digits = 3)
variance_precentage <- paste0(names(variance_precentage), " (", variance_precentage, " %)")

#Print the biplot as requested
biplot(my_scaled_pca, cex = c(0.6, 1), col = c("grey40", "red"), xlab = variance_precentage[1], ylab = variance_precentage[2])
text(x=-5,y=12.5, labels = paste0("Caption: After scaling, PC1 explains only\n",
                                  "53.6% of variance in the data. Variables,\n",
                                  "whose arrows align with each other, are also correlated.\n",
                                  "Similarly, aligning with either PC1 or PC2 axes\n",
                                  "signifies the variable's contribution to said component.\n"), cex = 0.75,col="blue")


2.3 Interpretation of PCA results


2.3.1 General considerations of PCA

In laymans terms, principal componen analysis (PCA) is a method whereby a maximum of the variance in multivariate data is distilled to as few degrees of freedom (or principal components) as possible using a linear function. In visual terms, a principal component can be described as a multivariate axis through the data. All observations are then projected on to this axis. Each principal component is derived by aligning this “axis” (or principal component) with the dimension of greatest multivariate variance in the data. This variance (or dimension) is then “removed” from the data for the calculation of subsequent principal components (i.e. the principal components are at a right angle to each other). Thus, after the first principal component, each subsequent principal component is able to describe only as much or less variance than the previous component. In the theoretical case of perfectly multivariate normal data, each principal component will be equally important for describing the data. The number of principal components is always as great as the degrees of freedom (=number of variables) in the data – otherwise the PCA would not be able to describe all variance in the data.

2.3.2 Sginificance of scaling/normalizing variables

Since variance can differ wildly for different variables, scaling the data is often necessary make better sense of the results. For example, if one variable in the data has much higher variance compared to other variables the first principal component will always align with this variable. This result may be nonsensical, especially if this variable simply has a larger scale or higher measurement noise which explains the high variance. However, if the intention is to simply identify the components (or axis) with the highest variance in the data, then, scaling is counter intuitive.

It should be noted that scaling the data essentially assigns “equal importance” for each variable in the PCA. Depending on the above considerations this may or may not make sense in your particular use case.

3 PCA: Non-scaled vs scaled


In the case of our example data, the variance of GNI is numerically far greater than for any other variable in the data. Thus, the before scaling PC1 almost entirely aligns with GNI and, in this analysis, explains 99.99% of variance in the data. This results makes no sense in our case since the variables are measured at completely different scales.

After scaling, GNI has lost its disproportinately large significance, as can clearly be seen from the biplot.


4 Interpretation: PC1 and PC2


As discussed above PC1 in the latter biplot is the multivariate dimeansion/axis in the scaled data where most of the data’s variance (53.6% to be exact) aligns. Thus, most of the variance in our data with 8 degrees of freedom can be described with only a single dimension. For PC1, the most significant variable was Life.Exp with Mat.Mor a close second. We can also assess correlations (since we scaled the data) in the data by looking at the sign and magnitude of the PC coefficients (i.e. the arrows aligning in the biplot). Opposite facing arrows imply negative correlations.

The second principal component is calculated after the variance described by PC1 is “substracted” from the data. Thus, PC2 is able to predict less of the multivariate variance in the data (16.2% to be exact) Also, as in the graph, PC2 is mathematically at a right angle to PC1 and thus they are wholly uncorrelated. The interpretation of the variables significance for PC2 is similar to PC1.


4.1 Extracurricular activities

4.1.1 When to scale or not to scale the data before PCA


To clarify the significance of scaling the data before PCA, let’s do a practical thought exercise:

Let’s say that we have a database face images and we wish to know which features in the image optimally tell apart the faces from each individual so as to make predictions of identity on any subsequent face images. For simplicity’s sake, let’s assume we can measure only three features from the images: X1, X2, and X3. These variables may or may not have different variance and/or be correlated. Additionally they may have different repeatability.

Now, let’s further assume that computing these variables for a single face image is time consuming and, even more importantly, comparing them through our vast database of face images is extremely resource intensive. Thus, we wish to find only a single dimensional combination/reduction of these variables that can best be used to distinguish one face from another. For this purpose of dimensionality reduction we wish to use PCA.

The question is: should we or should we not scale the variables X1, X2, and X3?

The correct answer is: it depends.

For example, let’s assume you know from previous research that these features are of roughly equal importance to telling apart the individuals based on a face image. In such a case, standardizing/scaling the data makes perfect sense.

As an opposite example, let’s assume that you know the uncertainty of measuring X1, X2, and X3 from double images of same individuals. Now if you have adjusted X1, X2, and X3 according to this uncertainty (for example, by dividing with intra-individual standard deviation of each variable) before running PCA, you should absolutely not scale these variables since you’re interested in catching the maximum amount of inter-individual variance in the data with a single linear variable.


4.1.2 A simulated example of the face recognition dilemma


Let’s do an example analysis of the above face recognition dilemma with simulated data:


library(MASS)
library(psych)

#First let's create a table with two highly correlated variables and 
#one completely non-correlated variable with significantly higher variance

#Covariance matrix of simulated data:
#    X1: X2: X3:
#X1: 1   1   0
#X2: 1   1   0
#X3: 0   0   16
#
# X1 and X2 will show sd of 1 and are have positive covariance
# X3 has sd of 4 and shows no covariance with either X1 or X2
#
# Matrix values as vector: c(1,1,0,1,1,0,0,0,10)
corr_matrix <- matrix(c(1, 1, 0, 1, 1, 0, 0, 0, 10), nrow = 3, ncol = 3)

#Means of zero for all variables
var_means <- c(0, 0, 0)

#Generate the multivariate normal data table:
test_data <- mvrnorm(n = 10000, mu = var_means, Sigma = corr_matrix, empirical = TRUE)

#Describe the simulated data:
describe(test_data)
##    vars     n mean   sd median trimmed  mad    min   max range  skew kurtosis
## X1    1 10000    0 1.00      0    0.00 1.01  -3.62  3.86  7.47 -0.01    -0.05
## X2    2 10000    0 1.00      0    0.00 1.01  -3.62  3.86  7.47 -0.01    -0.05
## X3    3 10000    0 3.16      0   -0.01 3.19 -11.87 11.95 23.82  0.02    -0.05
##      se
## X1 0.01
## X2 0.01
## X3 0.03
#Do the pca on the non-scaled data
my_simulated_pca <- prcomp(test_data)

#Print out 
my_simulated_pca
## Standard deviations (1, .., p=3):
## [1] 3.162278e+00 1.414214e+00 3.332001e-08
## 
## Rotation (n x k) = (3 x 3):
##                PC1           PC2           PC3
## [1,]  0.000000e+00 -7.071068e-01 -7.071068e-01
## [2,] -2.954859e-17 -7.071068e-01  7.071068e-01
## [3,] -1.000000e+00  2.089401e-17 -2.089401e-17
summary(my_simulated_pca)
## Importance of components:
##                           PC1    PC2       PC3
## Standard deviation     3.1623 1.4142 3.332e-08
## Proportion of Variance 0.8333 0.1667 0.000e+00
## Cumulative Proportion  0.8333 1.0000 1.000e+00
#Now do the pca after scaling data:
my_simulated_pca <- prcomp(scale(test_data))

#Print out 
my_simulated_pca
## Standard deviations (1, .., p=3):
## [1] 1.414214e+00 1.000000e+00 3.332001e-08
## 
## Rotation (n x k) = (3 x 3):
##                PC1           PC2           PC3
## [1,] -7.071068e-01 -1.942890e-16  7.071068e-01
## [2,] -7.071068e-01 -2.631457e-16 -7.071068e-01
## [3,]  3.234552e-16 -1.000000e+00  4.589946e-17
summary(my_simulated_pca)
## Importance of components:
##                           PC1    PC2       PC3
## Standard deviation     1.4142 1.0000 3.332e-08
## Proportion of Variance 0.6667 0.3333 0.000e+00
## Cumulative Proportion  0.6667 1.0000 1.000e+00

4.1.3 Interpretation of the example analysis


As can be seen from the example analysis, the results are completely different before and after scaling. Before scaling, X3 alone explains the most variance in the data. However, if each variable is assigned equal importance (i.e. variables are scaled before PCA), the combination of X1 and X2 (PC1 coefficients of -0.7 for both) is the best combination of the available variables for telling the “individuals” apart (= maximal variance on a single dimension).


5 Multiple correspondence


5.1 Exploration of dataset


#Let's make our reports reproducible
library(FactoMineR)

#Load the dataset
data(tea)

#Explore the structure and dimensions.
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

5.1.1 Visual Exploration


Frankly, meaningful visual exploration of the dataset with 35 categorical variables, most of which have only 2 levels, is challenging (and a bit too much for a contingency table). Infact, MCA can be used to perhaps simplify this issue but we’ll get back to this later…

In the meanwhile, a table instead of a plot, IMHO, is the most meaningful option for displaying the data.


#Print out a table of all variables
my_table <- tbl_summary(tea,
                        digits = all_continuous() ~ 2,
                        type = all_continuous() ~ "continuous2",
                        statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                              "{median} ({p25}, {p75})", 
                                                              "{min}, {max}"),
                        all_categorical() ~ "{n} / {N} ({p}%)"))

#Add some missing elements to the table and print it out
my_table %>% bold_labels()
Characteristic N = 300
breakfast
breakfast 144 / 300 (48%)
Not.breakfast 156 / 300 (52%)
tea.time
Not.tea time 131 / 300 (44%)
tea time 169 / 300 (56%)
evening
evening 103 / 300 (34%)
Not.evening 197 / 300 (66%)
lunch
lunch 44 / 300 (15%)
Not.lunch 256 / 300 (85%)
dinner
dinner 21 / 300 (7.0%)
Not.dinner 279 / 300 (93%)
always
always 103 / 300 (34%)
Not.always 197 / 300 (66%)
home
home 291 / 300 (97%)
Not.home 9 / 300 (3.0%)
work
Not.work 213 / 300 (71%)
work 87 / 300 (29%)
tearoom
Not.tearoom 242 / 300 (81%)
tearoom 58 / 300 (19%)
friends
friends 196 / 300 (65%)
Not.friends 104 / 300 (35%)
resto
Not.resto 221 / 300 (74%)
resto 79 / 300 (26%)
pub
Not.pub 237 / 300 (79%)
pub 63 / 300 (21%)
Tea
black 74 / 300 (25%)
Earl Grey 193 / 300 (64%)
green 33 / 300 (11%)
How
alone 195 / 300 (65%)
lemon 33 / 300 (11%)
milk 63 / 300 (21%)
other 9 / 300 (3.0%)
sugar
No.sugar 155 / 300 (52%)
sugar 145 / 300 (48%)
how
tea bag 170 / 300 (57%)
tea bag+unpackaged 94 / 300 (31%)
unpackaged 36 / 300 (12%)
where
chain store 192 / 300 (64%)
chain store+tea shop 78 / 300 (26%)
tea shop 30 / 300 (10%)
price
p_branded 95 / 300 (32%)
p_cheap 7 / 300 (2.3%)
p_private label 21 / 300 (7.0%)
p_unknown 12 / 300 (4.0%)
p_upscale 53 / 300 (18%)
p_variable 112 / 300 (37%)
age
Mean (SD) 37.05 (16.87)
Median (IQR) 32.00 (23.00, 48.00)
Range 15.00, 90.00
sex
F 178 / 300 (59%)
M 122 / 300 (41%)
SPC
employee 59 / 300 (20%)
middle 40 / 300 (13%)
non-worker 64 / 300 (21%)
other worker 20 / 300 (6.7%)
senior 35 / 300 (12%)
student 70 / 300 (23%)
workman 12 / 300 (4.0%)
Sport
Not.sportsman 121 / 300 (40%)
sportsman 179 / 300 (60%)
age_Q
15-24 92 / 300 (31%)
25-34 69 / 300 (23%)
35-44 40 / 300 (13%)
45-59 61 / 300 (20%)
+60 38 / 300 (13%)
frequency
1/day 95 / 300 (32%)
1 to 2/week 44 / 300 (15%)
+2/day 127 / 300 (42%)
3 to 6/week 34 / 300 (11%)
escape.exoticism
escape-exoticism 142 / 300 (47%)
Not.escape-exoticism 158 / 300 (53%)
spirituality
Not.spirituality 206 / 300 (69%)
spirituality 94 / 300 (31%)
healthy
healthy 210 / 300 (70%)
Not.healthy 90 / 300 (30%)
diuretic
diuretic 174 / 300 (58%)
Not.diuretic 126 / 300 (42%)
friendliness
friendliness 242 / 300 (81%)
Not.friendliness 58 / 300 (19%)
iron.absorption
iron absorption 31 / 300 (10%)
Not.iron absorption 269 / 300 (90%)
feminine
feminine 129 / 300 (43%)
Not.feminine 171 / 300 (57%)
sophisticated
Not.sophisticated 85 / 300 (28%)
sophisticated 215 / 300 (72%)
slimming
No.slimming 255 / 300 (85%)
slimming 45 / 300 (15%)
exciting
exciting 116 / 300 (39%)
No.exciting 184 / 300 (61%)
relaxing
No.relaxing 113 / 300 (38%)
relaxing 187 / 300 (62%)
effect.on.health
effect on health 66 / 300 (22%)
No.effect on health 234 / 300 (78%)

Nothing much to interpret here. Frequencies and proportions are printed as requested.


5.2 Running the MCA analysis


5.2.1 Forming a research question


So what are we running a multiple correspondence analysis for? MCA can be viewed as analogous to PCA; however, it is a generalization for catogorical variables. Continuous variables can be projected onto the dimensions (analogous to PCA principal components) created by the MCA. Thus, even continues variables can be contrasted with different patterns of categorical variables.

As such, the use cases of MCA and, also, analogous to use cases of PCA; reducing dimensions in a multivariate categorical data to as few significant degrees of freedom (or in this case patterns in the categorical variables) as possible. Similar to the variable coefficients of principal components in PCA we can use the DMA to also study associations with variable categories.

For forming a valid research question, let’s say we’re working at a tea shop and we wish to enhance our understanding of our clientele. More specifically we wish to indentify possible patterns in habits of tea consumption.

To identify possible patterns in tea consumption habits we include the first 18 variables (questions related to tea consumption) and demographic factors: sex, SPC, and age_q into our MCA analysis.


#Generate drop unnecessary variables:
tea2 <- tea[c(1:18, 20:21, 23)]

#Let's do the MCA analysis. 
my_mca <- MCA(tea2, graph = FALSE)

#Let's print a summary of our MCA
summary(my_mca)
## 
## Call:
## MCA(X = tea2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.133   0.128   0.101   0.091   0.079   0.070   0.068
## % of var.              7.323   7.074   5.596   5.037   4.380   3.859   3.736
## Cumulative % of var.   7.323  14.397  19.993  25.031  29.411  33.269  37.005
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.064   0.061   0.061   0.057   0.055   0.054   0.052
## % of var.              3.544   3.391   3.348   3.152   3.036   2.987   2.871
## Cumulative % of var.  40.549  43.940  47.288  50.441  53.477  56.464  59.335
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.050   0.049   0.045   0.044   0.044   0.043   0.040
## % of var.              2.760   2.719   2.488   2.445   2.409   2.370   2.198
## Cumulative % of var.  62.095  64.814  67.302  69.746  72.156  74.526  76.724
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.037   0.035   0.034   0.033   0.032   0.029   0.028
## % of var.              2.053   1.937   1.864   1.815   1.766   1.592   1.538
## Cumulative % of var.  78.777  80.714  82.577  84.393  86.158  87.750  89.288
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.026   0.025   0.025   0.022   0.020   0.020   0.018
## % of var.              1.461   1.405   1.356   1.196   1.113   1.099   0.969
## Cumulative % of var.  90.749  92.154  93.510  94.706  95.818  96.918  97.887
##                       Dim.36  Dim.37  Dim.38
## Variance               0.014   0.013   0.011
## % of var.              0.793   0.707   0.613
## Cumulative % of var.  98.680  99.387 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.478  0.574  0.093 |  0.233  0.141  0.022 |  0.024  0.002
## 2             | -0.263  0.174  0.051 |  0.290  0.219  0.062 | -0.313  0.323
## 3             |  0.071  0.013  0.002 | -0.131  0.044  0.008 |  0.168  0.093
## 4             | -0.608  0.930  0.242 | -0.274  0.195  0.049 |  0.163  0.087
## 5             | -0.310  0.242  0.084 | -0.046  0.005  0.002 |  0.073  0.017
## 6             | -0.687  1.186  0.227 | -0.242  0.153  0.028 |  0.053  0.009
## 7             | -0.002  0.000  0.000 |  0.079  0.016  0.005 | -0.032  0.003
## 8             | -0.083  0.017  0.005 |  0.217  0.122  0.031 | -0.228  0.171
## 9             |  0.213  0.114  0.026 |  0.438  0.500  0.110 |  0.076  0.019
## 10            |  0.326  0.267  0.053 |  0.561  0.820  0.158 |  0.188  0.117
##                 cos2  
## 1              0.000 |
## 2              0.072 |
## 3              0.013 |
## 4              0.017 |
## 5              0.005 |
## 6              0.001 |
## 7              0.001 |
## 8              0.034 |
## 9              0.003 |
## 10             0.018 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.179  0.550  0.029  2.966 | -0.007  0.001  0.000 -0.111 |
## Not.breakfast | -0.165  0.507  0.029 -2.966 |  0.006  0.001  0.000  0.111 |
## Not.tea time  | -0.548  4.718  0.233 -8.348 |  0.059  0.057  0.003  0.900 |
## tea time      |  0.425  3.657  0.233  8.348 | -0.046  0.044  0.003 -0.900 |
## evening       |  0.245  0.738  0.031  3.057 | -0.275  0.965  0.040 -3.437 |
## Not.evening   | -0.128  0.386  0.031 -3.057 |  0.144  0.505  0.040  3.437 |
## lunch         |  0.645  2.192  0.071  4.623 | -0.371  0.749  0.024 -2.656 |
## Not.lunch     | -0.111  0.377  0.071 -4.623 |  0.064  0.129  0.024  2.656 |
## dinner        | -0.877  1.935  0.058 -4.160 |  0.218  0.124  0.004  1.034 |
## Not.dinner    |  0.066  0.146  0.058  4.160 | -0.016  0.009  0.004 -1.034 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.126  0.356  0.015 -2.088 |
## Not.breakfast  0.116  0.329  0.015  2.088 |
## Not.tea time   0.284  1.656  0.062  4.323 |
## tea time      -0.220  1.283  0.062 -4.323 |
## evening        0.344  1.906  0.062  4.295 |
## Not.evening   -0.180  0.996  0.062 -4.295 |
## lunch         -0.105  0.076  0.002 -0.755 |
## Not.lunch      0.018  0.013  0.002  0.755 |
## dinner         1.046  3.603  0.082  4.963 |
## Not.dinner    -0.079  0.271  0.082 -4.963 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.029 0.000 0.015 |
## tea.time      | 0.233 0.003 0.062 |
## evening       | 0.031 0.040 0.062 |
## lunch         | 0.071 0.024 0.002 |
## dinner        | 0.058 0.004 0.082 |
## always        | 0.034 0.035 0.066 |
## home          | 0.006 0.000 0.026 |
## work          | 0.101 0.038 0.078 |
## tearoom       | 0.384 0.015 0.000 |
## friends       | 0.207 0.060 0.015 |
#Print a plot of the first two dimensions:
plot(my_mca, invisible = c("ind"), cex = 0.7)


5.3 MCA interpretation


First off, looking at our eigenvalues table, our data seems quite homogenous. This is reflected by the fact that the first two dimensions of the MCA are only able to explain 15.6% of the total variance in the data. In absolute terms this is not much more than what we would expect in random data for each dimension (100% / 33 dimensions = about 3%). However, in relative terms this is still quite a bit better.

The tables for categories and individuals can be interpreted in identical fashion: the first column is the row’s contribution (or coordinates) for each dimension. The second column is percentage of contribution to the variance of the dimension. Third is cos2, i.e. the squared correlation of the row to the dimension. And finally v.test is a normalized statistic for the rows coordinate on the dimension. Had we used supplementary variables in our analysis, a similar table would have been printed for them.

From the MCA factor plot we would be tempted to make following deductions: 60+ individuals tend to drink our upscale (unpackaged) tea (at a teashop). Similarly, (a probably correct) inference is that 15-24 individuals are more often students. But…….


5.3.1 The devil is in the detail!


HOWEVER, interpreting the results of MCA is more complex than that: the above deductions are not all correct!

Let’s illustrate this point by doing an alternative MCA with the same data:

#Choose variables that were seemingly associated with drinking the "upscale" tea
#(We, as a teashop, whish to maximize our profits!)
tea3 <- tea[(colnames(tea) %in% c("how", "age_Q", "where", "Tea", "SPC","price"))]

#MCA
my_mca <- MCA(tea3, graph = FALSE)

#Let's skip the text output and just print the plot:
plot(my_mca, invisible = c("ind"), cex = 0.7)


5.3.1.1 Interpretation


Looking at the first plot, we would have been tempted to say that yes, indeed, 60+ individuals know their tea and mostly drink “upscale” tea. Infact, on dimesion 1 being 60+ is somewhat opposed to drinking “cheap” tea. Yet, looking at the plot from our second MCA, quite the opposite is seemingly true: being 60+ is suddenly associated with drinking cheap tea! What gives?

This discrepancy is caused mainly by the fact that each of our dimension catches only fairly small amounts of variance in our data. As such we should be considering more than only 2 dimensions at once. This brings us, however, to the original dilemma with our data: we would have liked to reduce the dimensions such that we could visualize the data. This task was not, unfortunately, reliably accomplished by MCA in this scenario…


5.4 A critique of MCA

(… and why it should probably not have been taught as part of this course.)

I must emphasize that interpreting MCA is quite challenging and fraught with pitfalls that can lead to entirely wrong conclusions about the data (as demonstrated above) if due diligence is not excercised. Thus, any perceived correlations/patterns observed or perceived based on MCA should be confirmed by examining the source data. IMHO, in only limited cases MCA can be useful for getting a general understanding of patterns in multivariate categorical data. Similarly, using MCA for deriving a continuous reduction of categorical data is a seemingly fringe use case (although, admittedly, I’m no expert in the field).

For example, in our hypothetical case of a teashop the truly significant research question would have probably been more focused: e.g. “to which categories of tea drinkers we should focus our advertising to maximize our sales of the most pricy”upscale" tea". In such a case even a simple correspondece analysis would have made more sense as compared to MCA.

Ironically, the “tea” data is a prime example of where MCA fails as a dimensionality reduction technique and serves more to confuse rather than help. The large number of categorical variables in the data fail to reduce to meaningful representations on a comprehensible amount of dimensions. Subsequently, if more dimensions are considered, the results of the MCA become increasingly difficult to understand — perhaps even more so than just considering individual pairs of variables in the data.